How VRCodes reboot smartcodes?
=A Short Answer= ---- Small, inexpensive, robust networked processing devices, distributed at all scales have dominated every day life throughout the late years of a ubiquitous computing era. Among these contemporary devices, mobile phones equipped with high-quality sensing devices, high end displays, and digital cameras are included. Subsequently to the outburst of the aforementioned emerging technologies, people's everyday environment has also undergone major changes. Wireless handheld computers running different operating systems and equipped with a variety of sensors (i.e., accelerometer, gyroscope, high-end digital cameras etc.) and software applications offer massive amounts of visual information to the everyday user regardless of time and location. Digital cameras from professional SLRs to cellphone cameras are now ubiquitous and are used not only for beautiful photography but also for accessing information by recognizing and understanding scene elements or capturing machine-interpretable data. Along with the proliferation of active displays in public spaces, adding intelligence to objects via tags, and building human or machine interactions around them greatly simplifies navigation and data transfer applications. Towards this end self-identifying tag technologies such as barcodes and RFIDs, greatly simplify the recognition problem. However, typical barcodes pose two problems #they are large in size relative to the amount of information they carry, and #must be carefully designed for readability by simple devices that usually work in close proximity to the barcode. The problem of creating tags to supplement the physical world has given rise to many passive as well as active solutions which use projectors, bluetooth, LEDs and even IP. Typical barcodes were intended to identify objects on a printed medium/tagged objects. However, in order to increase information capacity, barcodes need to be printed bigger and utilize more dimensions which immediately poses constraints on the relative position and orientation of the tag. Thus, barcodes are large in size relative to the amount of information they carry, and must be carefully designed for readability by a simple device that usually works in close proximity. Even two dimensional fiducial markers used in Augmented Reality (AR), robot navigation, and other applications also suffer from the same problem. Smart future environments though, require efficient machine interaction with aesthetically pleasing visual tagging that is compatible with digital cameras. Envision a world where inconspicuous and unobtrusive display surfaces transmit both words and pictures as well as machine-compatible data. Unfortunately, the barcodes are visible to the human eye, and they take up precious visual real estate. Therefore, there is a need for enhancing human-machine interaction without further cluttering the human-visible space. Video Response Codes (VRCodes) G. Woo, A. Lippman, R. Raskar, "VRCodes: Unobtrusive and active visual codes for interaction by exploiting rolling shutter," IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp.59-64, 2012. propose to offer such an improvement by employing commodity digital cameras that can decode information invisible to humans. They are part of a lately introduced framework (early 2012) ``NewsFlash replaces QR codes with invisible flashes of light,'' The Verge, April 2012. ``MIT NewsFlash Uses Light as Alternative to QR Codes,'' ``Newsflash uses high-frequency light to transmit data from iPad to smartphone, we go hands-on,'' Engadget, April 2012. ``QR to VR: The smartcode rebooted," Wired UK Magazine, Nov. 2012. by MIT Media Lab named Newsflash that allows independent displays and cameras to be turned into an interactive environment. In particular, they are based on a novel visible light-based communications architecture which allows decoding and detection of embedded codes in pictures by an inexpensive, off-the-self camera. Barcodes are decoded using a flying spot scanning laser and a single photodetector which identifies either absence or presence of light reflected from the 0-1 stripes on the barcode A. Q. Morton, ``Packaging history: The emergence of the Uniform Product Code (UPC) in the United States," History and Technology: An Intl. Journal, pp. 101-111, 1994.. Newer codes exploit 2D imaging of advanced scanners and pack more information T. Pavlidis, J. Swartz, and Y. P. Wang, ``Fundamentals of bar code information theory," IEEE Computer 23(4): 74-86, April 1990.. They include Data Matrix ISO 2006, ``Data Matrix bar code symbology specification," ISO/IEC, 16022, 2006., QR codes ISO 2006, ``QR code 2005 bar code symbology specification," ISO/IEC, 18004, 2006., and Aztec codes, all of which use Reed Solomon D. J. C. MacKay, ``Information Theory, Inference, and Learning Algorithms," Cambridge University Press. error correction. Multiplexed barcodes T. Langlotz, and O. Bimber, ``Unsynchronized 4D barcodes", in ISVC, 363-374, 2007. use multiple color channels and temporally changing codes (using an LCD or a projector) to maximize the data throughput and the robustness of the barcode recognition. Radio Frequency Identification (RFID) tags R. Want, ``A key to automating everything," Scientific American, 2003. are used to determine the presence of an object within a certain range, but do not reveal its location. They suffer from lack of sufficient directionality, and interference with neighboring tags. Passive RFIDs reading distance may be very limited comparing to active RFIDs which however require an on board power source. They also have significant security issues. Moreover, in the case of RFIDs, many phones do come with a near-field-communications (NFC) reader but can not extract relative positioning and orientation. Active visual codes T. Langlotz, and O. Bimber, ``Unsynchronized 4d barcodes: coding and decoding time-multiplexed 2d colorcodes," in Proceedings of the 3rd Intl. Conf. on Advances in Visual Computing, pp. 363-374, 2007.A. Grundhofer, M. Seeger, F. Hantsch, and O. Bimber, ``Dynamic adaptation of projected interceptible codes", in ISMAR, pp. 1-10, 2007., use multiple color channels and temporally changing codes displayed on either a LCD screen or projector to maximize both data throughput and robustness of the barcode recognition with respect to a camera. Such codes contain much more information and can be read by a well-designed decoder on a camera. On the contrary of all the above aforementioned methods, VR-Codes provide a decoding mechanism in a system built by ordinary displays and off-the-self rolling shutter cameras which can be found in the majority of every day mobile devices. Furthermore, their design includes #methods to spatio-temporally embed digital data in active displays without interfering to the human eye and #provides an encoding technique that allows data recovery, relative distance and angle of the camera information without any artifacts. VR Codes limit the number of perceived colors as a function of the display parameters available, while dynamic range is improved by using a higher-end 120Hz screen. Moreover, they provide distance and angle identification for each frame. The design of VR Codes is based on the fusion flicker frequency threshold of the human eye in combination with the behavior of a rolling shutting camera. Motivation for modeling the behavior of the rolling shutter camera comes from the proposal of authors in G. Hong, R. Luo, and P. Rhodes, ``A study of digital camera colorimetric characterization based on polynomial modeling". to remove the metameric effects in displays. Digital data embedding in active displays is based on the opportunistic use of the moire pattern which can be seen only by the camera and not by the human eye. This approach is similar to the microscopic analysis of pixels which are non-resolvable by humans, or we can even imagine a light-modulating light bulb, flickering beyond a perceivable rate. ---- =A Long Answer= ---- In the short answer we presented the limitations and lack of flexibility that today's proximal and vision-based communications systems pose, since they rely on specific hardware features. For example, visual barcodes require standardization across all devices, while at the same time work poorly from afar. VRCodes present a novel hue-based barcode design which may be read by a rolling-shutter camera with consumer-grade capture speeds but remains unobtrusive to the human eye. These codes are used not only for embedding data but also relative positioning and orientation. Complementary hues at 60Hz (beyond the typical fusion frequency of the human eye) are temporally switched on a commodity active screen, thus, allowing human to see a pure gray-colored screen whereas a 30fps ordinary mobile camera will see the effects of changing hues. ---- Effective Rolling Shutter Speed ---- Current surveillance video cameras and mobile cameras use CMOS image sensors which are simple digital light collectors designed to behave in a similar way to the analog camera. Unlike the Human Visual System (HVS) they simultaneously read pixels into a line-memory and get exposed to incoming light rays. In this Figure the architecture of most consumer-grade camera shutters that implements line-scan frame acquisition is summarized. The following design is motivated firstly by the read speed of each line which is not sufficiently fast in circuitry so as to achieve the effect of a global shutter as well as from the difficulty to design a CMOS sensor that can simulate a true optic shutter. For the last case a storage element within each pixel is needed to hold the charge before it is time to be read out. The Figure attached above assumes a simple setup of a rolling-shutter camera pointed at a single point source that blinks on and off with a period T_{tx} . The time needed for the access driver to ask for another batch of lines is denoted as t_r , while the time required for dumping the collected data is the line scan acquisition time t_{\alpha} . The actual amount of time that each sensor is exposed to light is t_c and the total acquisition time in practice is: t_d = t_r + t_c + t_{\alpha}, and the reported frame rate for the camera is: fps_{rx} = \frac{1}{T_{rx}}. Each frame may contain many batches of line scans each of time t_d which occur sequentially n times and also overlapping in parallel as it is shown in the following Figure. In other words a typical rolling shutter camera with reported frame of T_{rx} actually has n number of t_d length readouts for a blinking point source. The effective time that a pixel is exposed to light during a line scan T_s = \frac{T_{rx}}{n} = \frac{1}{fps_{rx} * n}. is dependent on the exact number of readouts n . For example, for n=5 on a 15-fps off-the-shelf camera, the effective fps is now 75-fps since the effective "shutter" is 0.0666/5 = 0.0132 seconds. In the case that the single point blinking source has a period of T_{tx}=0.02 seconds, a 15-fps rolling shutter camera can resolve the blinking light. ---- Metamerism and Human Visual System ---- It is clear that HVS can not be modeled in the same way as the camera. The frequency of visible light colors is extraordinarily high on the order of terahertz when compared with the fastest nervous impulses in our brain that are only on the order of kilohertz (Gregory, 1979). In early 1800s, Thomas Young proposed that humans possibly have three color receptors and perhaps all colors come from mixing the combinations of these three. In Young's experiment three distinct colored-lights (red, green and blue) spaced apart from one another are projected in order to create any spectral hue. Since then these three primary colors are called "tristimulus" values. The CIE RGB color space illustrated in the Figure on the left is generated from user observations. On one side, a testing color is projected, while on the other side, three orthogonal colors red, green and blue which are adjustable in brightness are exposed and users are asked to match both sides. In this way, based on experimental data, CIE color space shows how each hue can be broken down. Apart from color-space research, VRCodes also consider the psychophysics world research. The temporal frequency at which an intermittent light appears steady to a human observer is considered to be the flicker fusion threshold. Depending on the color wavelength as well as the amplitude and depth of the modulation this frequency may vary around 90Hz (Gregory, 1979). The flicker fusion threshold is relatively high when high-contrast pairs of colors are flickered alternatively. Though when low-contrast pairs of colors are used the flicker fusion threshold is lower. VRCodes are tested in both low and high-refresh screens, 60 and 120Hz respectively. In the first low-contrast pairs of colors are shown in order for the HVS to not perceive any flickering. On the contrary, in high-refresh screens, high-contrast pairs are shown, thus offering higher data capacity. By picking a line segment from the CIE color space, we get two colors: Color A is ( c_{ar} , c_{ag} , c_{ab} ) and Color B ( c_{br} , c_{bg} , c_{bb} ). The perceived color when alternating Color A and Color B beyond the appropriate flicker fusion frequency may be predicted using an average: (c_{pr}, c_{pg}, c_{pb}) = \left(\frac{c_{ar}+c_{br}}{2}, \frac{c_{ag}+c_{bg}}{2}, \frac{c_{ab} + c_{bb}}{2} \right). For example, when Color A is set to (0, 0, 255) and Color B is set to (255, 255, 0), the resulting perceived color is (128, 128, 128), which is also located in the center and is the average of Color A and Color B. One of the limitations that VRCodes set is that the number of produced perceived colors is determined by the CIE chart and the spectrum range of the display. The following Table shows how the candidate metamers change for different flickering frequencies. The colors in the center of the graph are easier to produce with different combinations than the colors on the periphery. The strategy followed here is that for any perceived color, it is desirable to pick c_a and c_b on a line segment such that the Euclidean distance is maximized, while at the same time average together and create C_p : \max_{(c_{pr}, c_{pg}, c_{pb})} |C_a - C_p| + |C_b - C_p|. For each display a special and uniquely-generated CIE chart is generated. It differs according to color, flickering frequency, and make of television. A variety of users with different color perception is used for the generation of each CIE chart and in fact the results are averaged across the user study. In a similar way the flicker fusion threshold may also vary up to 50% (Gregory, 1979). ---- Encoding Position and Data In the Figure attached on the right an active video screen is shown to both a camera and a human viewer. In the active video screen, VRCodes are shown at a higher frame rate than the flicker-fusion frequency. The camera is a rolling shutter, and therefore can resolve and decode the VRCodes. ---- A VRCode is composed from a sequence of symbols. Each symbol is assigned a pair of colors which produces a metamer that matches the original desired color. Therefore, depending on the number of available pairs that can produce the desired color, the effective data capacity increases. On most off-the-self displays this corresponds to a desired gray background. This Figure shows how a pair of colors is assigned a symbol representing logical "1" and a consecutive solid color is assigned a symbol representing logical "0". b is the spatial width of each color band as it is imaged on the camera sensor, and T is the spatial width of each cell (unit of encoding). More data capacity can be achieved according to the number of acceptable color combinations that can be resolved by the camera. A basic Hamming Code scheme 18004:2006, Information technology. Automatic identification and data capture techniques. Bar code technology. QR code. for blocks of size 5 is used for an incoming sequence of raw data. The resulting encoding sequence is then mapped to alternating and solid colors. Relative positioning is demonstrated by employing a checkerboard of alternating colors and solid gray colors to measure stability. In the following, the encoding process is demonstrated in an actual design. The decoding process resembles the typical radio frequency (RF) decoder. Each captured frame is processed in real-time as part of a video-processing loop. For each loop a chunk of bits are stored and passed up to the application. ---- Preprocessing ---- Color equalization is used for firstly preprocessing each frame. In contrast with the decoding of QR-codes, a binary threshold is not sufficient, since there may be two color candidates for a single threshold. This is called an artifact according to the study of T. Langlotz, and O. Bimber, ``Unsynchronized 4D barcodes", in ISVC, 363-374, 2007. for unsynchronized barcodes. In this system setup, a binary scheme is used, thus a shortcut filter makes a decision for each pixel with hue or no hue. ---- Natural Marker Assistance ---- The black edges of the screen are used so as to define a search region for the encoded sequence area. Natural markers are used for reducing the processing time. The entire frame can be also used as a pilot sequence. ---- Homography ---- The homography is decided by the expected shape of the marker and is used for deciding on the transform to be applied on all real-time frames until another homography is found in the background. Then another transform is applied on each of the following frames, maintaining in that way real-time processing. ---- Sampling ---- After deciding on the transform to be applied, a sampling grid is created where each value can be read out from the 2D frame. These values are quite important as they determine the confidence of the estimated value comparing to the original value, while multiple samples from each cell can boost the confidence as it will be shown in the next Section. Each of the analog values are assigned a symbol for the assigned threshold value c_{th} . ---- Decode ---- A Reed Solomon decoder is used for the decoding of each sequence. The decoded sequence is then passed to the application. In case of positioning, there is no decoding process and the sampled points are used for formatting the relative positioning vectors. ---- Results ---- VRCodes are tested on a real testbed consisted of a Samsung LN46 1920x1080 120Hz 3D television as well as 60Hz Apple I-Pads. The results shown in the following Figures are taken with a digital SLR camera with adjustable global shutter speed. On a 120Hz display, higher contrasts are allowed, therefore an encoding scheme with more color pairs may be achieved. On the contrary, on a 60Hz display, lower contrast levels and fewer combinations of colors are allowed. However, lighter contrast corresponds to lower signal levels that make decoding more challenging. The following Figure on the left shows the result of using 4 pairs of colors that show the same perceived gray by the user. Gray color has many combinations (i.e., black-white, red-blue, green-magenta, blue-yellow). The following Figure studies the effects of the user moving away from the screen ( d=1m , d=2m , d=3m ). The absolute number of pixels in width of each band remains the same, even though the bands appear to become narrower. This is due to the sampling size as d increases between the user and the camera. The only difference with the geometric properties used in 2D barcodes is that VRCodes appear unobtrusive to the human eye. The next Figure on the right shows a good way for creating a position marker by tiling different cells adjacent to one another. Regardless of the positioning of the phone camera, the bands are always aligned with the phone. A checkerboard pattern is used to deduce relative orientation when placed adjacent to a reversed sequence color pattern. Although the transmitted surface still seems as a single texture to the human viewer the marker can be used to provide relative orientation and positioning. Hough line detection algorithm is used for detecting relative edges in the pattern. When a line segment is found, it is returned as a vector along with a relative angle. By taking the average of the collected vector and angles we calculate the relative positioning of the phone with the active screen. The accelerometer of the mobile phone is used for telling the difference between upside down and upright positions. The following Figure reports stability experiment results of position tracking data using a modified Hough line transform. The background appears as an ordinary gray screen to the human viewer, though with the rolling-shutter camera, a checkerboard with trackable features appears. The x-axis shows a frame number and stable tracking of movement from "still" to "rotate" is also demonstrated, suggesting that VRCodes are eligible as a visual marker. ---- =Advanced Material= ---- Decoding improvement of VRCodes depends on having a clear understanding of the freespace optical system block diagram. Currently, there are many visible light communication (VLC) broadcast configurations that explore the potential of establishing a data communication link, though they use custom freespace optical hardware equipment. On the other hand, VRCodes use consumer hardware for establishing data transmission over a channel model and in that way propose a communication system that uses a camera and a LCD display to communicate using visible light. ---- Sensor Sampling Confidence ---- Each reported sample is a (R,G,B) value represented as c_i = (r_i,g_i,b_i) . As a result, each color is recovered independently. A single reported value may mismatch the representation of the actual source due to noise coming from sources such as camera's thermal noise, lack of access to raw sensor data and errors in spatial-sampling. In the VRCodes context the noise level of the camera is modeled as Gaussian with variance \sigma^2 to present a decision mechanism for estimating the transmitted color. Let us now define as h and w the vertical and horizontal resolution respectively. The rolling shutter CMOS camera has an orientation and the horizontal dimension is defined as the one to be aligned with the scanlines. The effective camera resolution for each line scan readout is \rm{sampling~pixels} = h/n*w For parameters h=1280 , w=960 and n=5 , each line scan may have on the order of h=1280/5 *960 = 2000 samples of the same source. Multiple observations of the same point source may boost the effective confidence and result in a closer estimation of the original value. For example let c_1, \dots, c_k be multiple observations of a binary scheme, x^{\alpha} = (x_1^{\alpha}, \dots, x_m^{\alpha}) be the amplitude values corresponding to symbol \alpha and x^{b} = (x_1^{b}, \dots, x_m^{b}) be the amplitude values corresponding to symbol b for a threshold c_{th} . The overall confidence is boosted by maximizing the following recovery probability: \rm{if} \sum_{i=1}^{k}\frac{c_i}{\sigma_i^2} \leq c_{th},~\rm{then~symbol~a, ~otherwise~symbol~b}. The overall confidence of the estimated (R,G,B) value is boosted by the multiple observations. The number of measurements is dependent on the range of camera sensors. ---- Transmitter ---- The following Figure summarizes the basic transmitter chain. A scenario where incoming bits are first compressed and then split into several frames is considered. Forward error correcting codes are also used in order to protect the bits through the lossy channel. Finally an additional block in the transmitter's chain creates a feedback system that determines the physical placement of the protection bits. ---- Receiver ---- The Figure on the left describes the receiver chain. The preprocessing elements are based on the ones used in image processing. The postprocessing elements are borrowed from similar decoding mechanisms used in RF. Spatial tracking is used to determine the location of the data in the scene. Corner search is implemented via a modified corner algorithm. In case of many corners in the scene each of the candidates is compared to a large quadrilateral generated from the second frame. This is done with the use of a fast corner detector which takes as input the incoming frame and the sampled frames. The quadrilateral coming from the second frame is obtained by repeatedly blurring and low-passing the second sampling frame and then converting it to a high contrast black and white image. By discovering all the white regions in the scene a "square score" is calculated: \left(\frac{\min \left(\sqrt{\rm{area}}, \frac{\rm{perimeter}}{4} \right)}{\max \left( \sqrt{\rm{area}}, \frac{\rm{perimeter}}{4} \right)}\right) All the distances are calculated from the corner candidates in the first frame to the centroid of the square found in the next frame. As corner calibration points for this block are considered the four corners with distances closest to each other. Finally as corners for this round we consider the brightest pixel point associated with these corner calibration points. When the scene is found, it is restored with an image matrix transformation. The corners found form a transfer function and a homogeneous inversion is applied. The following frames are cropped according to the corners and transform coordinates found, which are also passed through the rest of the receive chain until another dark calibration frame is found. The image is then recovered by adaptive equalization and is passed through a high-contrast filter with contrast limits of 0.2. As a result, it is converted to a high contrast black and white image. Timing recovery is performed by detecting all the centroids of each one of the points by using standard logic functions. Sampling is done over the time recovered frame. Sampling takes as input an incoming color image and slices it into three parts, where each part is considered with grayscale levels. Each slice is adaptively equalized followed by high contrast adjustment. The result is then sampled according to the points found by the timing recovery. The following Figure illustrates the bit-error-rate (BER) curve with respect to distance for encoding in all color channels for a 4.96Mbits/frame/screen in a 3x3 array of cameras and screens. The total transmit size of each frame is 1599x1035 pixels. In the array setup each camera is focused on each display. A square black and white checkerboard carrying a throughput of 7.86Mbits is used for angle evaluation. The next Figure depicts BER as a function of different viewing angles and a fixed distance of about 2m. As the angle increases, BER also increases to 10^{-2} . Finally, the last figure on the right shows that even though the channel model is analytical the resulting BERs still have randomness. However, one can see that the noise model for a bit is represented by a simple Gaussian noise model which can furtherly be improved. The experiments conducted above are based on commodity hardware which can be used for many interactive scenarios. Depending on the application and needs of the networks there are still many improvements that can result in higher bit rates. Current state-of-the-art techniques for embedding information utilize all physical dimensions of space, time and wavelengths and require custom software and hardware deployments that also occupy valuable visual real estate. VRCodes revolutionize the areas of ubiquitous computing and human interaction by proposing a new flexible, unobtrusive design that is decodable by a well designed software decoder. VRCodes design may also benefit in the future other applications in augmented reality including camera calibration, hidden debug screens and novel screen-to-camera interfaces. ---- =References= ----